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Scene segmentation is an essential step in a wide range of video processing 
applications, for instance, object recognition and tracking. The Gaussian 
mixture model (GMM) for background subtraction (BS) has gained 
widespread usage in scene segmentation, despite its known computational 
intensity. To tackle this challenge, we propose a practical solution to 
accelerate processing through a parallel implementation on an embedded 
multicore platform. In this paper, we present an improved automated parallel 
implementation of the GMM algorithm using the Orphan directive provided 
by open multiprocessing (OpenMP). Experimental assessments conducted 
on the eight cores of the C6678 digital signal processor (DSP) demonstrate 
significant advancements in parallel efficiency, particularly when handling 
high-resolution frames, including high-definition (HD) and full-HD 
resolutions. The achieved parallel efficiency surpasses the results obtained 
with classical OpenMP scheduling modes, encompassing dynamic, static, 
and guided approaches. Specifically, the parallel efficiency reaches 
approximately 82% for full-HD resolution frames and, 99.3% for low- 
resolution frames, respectively. 
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1. INTRODUCTION 


At the core of the moving object detection process, background subtraction (BS) is considered as 
critical step responsible for extracting moving objects [1]. Over the past two decades, this field has 
experienced significant algorithmic advancements, with researchers proposing various techniques that differ 
in their approach to updating and generating the background model. Comprehensive literature reviews of 
these techniques are presented by Garcia-Garcia et al. [2]. According to [2], BS algorithms can be classified 
into four distinct categories, based on the characteristics of the background pattern: i) mathematical concepts: 
fuzzy models [3], statistical models [4] and Dempster—Schafer models [5]; ii) machine learning techniques: 
support vector machines [6] and neural networks [7]; iii) signal processing techniques: Wiener filter [8] and 
Kalman filter [9]; and iv) classifications models: clustering algorithms [10]. 
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While recent BS methods provide satisfactory accuracy, the use of complex models is 
computationally demanding. An efficient algorithm should balance accuracy with a low computational load. 
According to [11], Gaussian mixture model (GMM) is a promising candidate for addressing these constraints, 
as it represents an optimal compromise between accuracy and performance. Consequently, researchers have 
proposed statistical methods to manage and overcome challenges in background scenes, such as illumination 
and noise. For real-world applications, researchers have employed basic techniques like GMM [4], codebook 
[10] and visual background extractor (ViBe) [12]. This choice is driven mainly by memory and time 
requirements required by new BS methods [2]. 

In the literature, GMM is acknowledged as one of the predominant statistical methods [4], [13], 
[14]. Stauffer and Grimson [4] presented an adaptive model for real-time tracking, where each pixel is 
characterized by a mixture of Gaussians. This representation is dynamically updated in real-time through the 
incorporation of new input frames. Zivkovic [15] enhanced the GMM algorithm by proposing dynamic 
updates of K Gaussians for each pixel. As a result, K is adjusted dynamically to the multimodality of every 
pixel in accordance with scene evolution. 

The fields of image and video processing have experienced a significant surge in challenges and 
complexity to meet the demands of real-time applications. This is explained by the market demand for 
images with high resolution (i.e.: full-high-definition (full-HD) 1920x1080 and HD 1280x720), in various 
application areas, including the detection of traffic violations, surveillance of national borders, and 
monitoring critical government infrastructure. Consequently, video processing has become both bandwidth 
and computationally intensive. To address this challenge and limitation, parallel processing techniques are 
essential to achieve high computational performance and fulfill real-time requirements. In this paper, the 
computational platform chosen is the multicore C6678 digital signal processor (DSP) from Texas instruments 
(TI), selected for its advantageous features, including high computing performance and low power 
consumption [16]. 

Over the years, several studies have examined automated parallel implementations based on open 
multiprocessing (OpenMP) for the GMM BS algorithm, with the aim of enhancing its computational 
performance and parallel efficiency. Szwoch et al. [17] suggested a parallel implementation of the GMM BS 
using a supercomputer comprising 192 nodes connected with an InfiniBand network. Each computing node 
consisted of two six-core CPUs in the Xeon EM64T architecture. OpenMP was used, and both static and 
dynamic scheduling techniques were evaluated. The achieved speedup value could not exceed 3.75 for 
medium frame resolution and 2.7 for HD resolution frames when twelve threads were utilized. 
Mabrouk et al. [18] proposed a parallel implementation of the GMM BS on a multicore platform, which 
included two Intel Xeon(R) CPU E5-2670 8-core processors. The distribution of processing across the 
multicore platform was accomplished through the application of OpenMP, resulting in a speedup of 11.6 for 
the HD resolution frame when sixteen cores were enabled. In our previous work [19], we evaluated OpenMP 
classical scheduling (OCS) modes (e.g., dynamic, static, and guided), and found that only dynamic 
scheduling provided a high speedup compared to other scheduling modes, such as guided and static. The 
maximum speedup achieved with eight enabled cores was 3.6 for HD resolution frames. 

The main contribution of this paper is the parallel efficiency improvement of GMM BS algorithm 
on multicore DSP platform. This is achieved by selecting a suitable OpenMP directive: OpenMP orphan 
directive (OOD). Indeed, the OOD approach proves particularly advantageous, simplifying the task of 
implementing coarse-grain parallel algorithms [20], in which very large program regions are parallelized. 
The overall results demonstrate a significant improvement in speedup, even in the case of full-HD and HD 
resolution frames. The paper is structured as follows: section 2 introduces the GMM BS algorithm, describes 
the experimental setup, and outlines the proposed parallel implementation approach. Section 3 presents the 
experimental findings. Finally, a conclusion is provided in section 4. 


2. MATERIAL AND METHOD 
2.1. Gaussian mixture model for background subtraction 

GMM has gained prominence in the field of BS. The pioneering work of Friedman and Russell [21] 
introduced a probabilistic model, wherein each pixel was characterized by a weighted sum of a limited 
number of Gaussian distributions. Subsequently, Stauffer and Grimson [4] made significant contributions by 
presenting an advanced GMM, accommodating K Gaussian models per pixel, typically K takes value within 
the range of 3 to 5 [4]. This advancement marked a significant stride in refining the GMM technique for BS. 
The formulation of the probability associated with the current pixel value, as illustrated in (1), underscores 
the inherent probabilistic foundation of this methodology: 


P= SE; DN Hip dip e 
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where: )); ~=denotes the covariance matrix; y; ¿=the mean value; w; ¿=represents the weight of it”Gaussian in 
the mixture at time t; n (Xe, Hit» Dit )=specifies the Gaussian probability density function. 

Comparative is conducted between the incoming pixel and the GMM to identify the pixel in 
proximity to 2.5 standard deviations. Two distinct scenarios are encountered: scenario 1: a match is 
established, prompting the adjustment of both the mean and the variance for the corresponding Gaussian 
distribution; and scenario 2: if no match is identified, the new incoming pixel substitutes the least probable 
component within the mixture. 

In (2) outlines the process for updating the weights of the K distributions. 


Ox = -a)@Ok 1 +a(My, ) (2) 


Where: a is the learning rate and M,; equals 1 for the matched model, and 0 otherwise. 
The parameters update of the matched distribution is defined by (3) and (4). 


7 =(1-p)o?.+ p(x) ru) (3) 
=U -p)u,  +Px, (4) 


Where: x, represents new input frame pixel value; u+- and o,_, represent the last mean and variance values 
of the matched Gaussian. The (5) represents the second learning rate, denoted as p. 


p=a* (x lity 04) (5) 


The last step encompasses the estimation of the background, involving the sorting of Gaussians 
based on the /o ratio. The initial ranked distributions B, with a cumulative weight sum surpassing the 
specified threshold (Th), are identified as the background, as described by the (6). 


B=argminy (XÈ-1@;>Th) (6) 
Where: Th represents the minimum threshold of the background model. 


2.2. TMS320C6678 evaluation module overview 

The TMS320C6678 evaluation module was used as the experimental platform, featuring a single 
C6678 DSP chip and 512 MB of dynamic random-access memory (DRAM) memory. The C6678 chip 
comprises eight DSP cores, each operating at a clock frequency of 1 GHz and delivering a computing 
performance of sixteen giga floating-point operations per second (GFLOPS). Notably, the architecture of the 
C66x DSP cores is based on very long instruction word (VLIW) design [16], [22]. The memory structure of 
the C6678 DSP is hierarchically organized into various levels, with the on-chip memory (L1) representing 
level 1, ensuring expedited CPU access compared to the external memory. 

Detailed view of the TMS320C6678 DSP components is shown in Figure 1. A comprehensive 
functional block diagram of the C6678 board is depicted in Figure 1(a), while Figure 1(b) illustrates the 
C6678 evaluation module. These capabilities have inspired numerous research communities to develop real- 
time applications using this hardware platform [23]-[26]. Throughout our implementation, we utilized 
version 8.3.7 of the C6000 TI compiler. 


2.3. Parallelization method 

The OpenMP serves as an application programming interface (API) that facilitates parallel 
programming on multicore platforms characterized by homogeneous processors and shared memory 
architectures [27], [28]. It facilitates the handling of parallel implementations by offering directives that 
specify to the compiler the parallel regions within the code. Users are also required to select appropriate 
scheduling techniques to effectively distribute processing tasks among different cores. The choice of 
scheduling type significantly influences the overall performance outcomes. On the other hand, a deep 
understanding of the algorithm structure and the nature of the algorithm's workload loop is considered a key 
factor in identifying accurate OpenMP scheduling. Indeed, in the case of the GMM BS algorithm, the 
workload is considered irregular. This irregularity arises from the dynamic nature of the algorithm and its 
dependence on the complexity of the scene. The irregular workload can be attributed to several factors, 
including: i) varying background complexity: Different pixels in an image may have varying complexities in 
the background due to changes in lighting or object movements and ii) adaptive model updating: GMM 
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models need to be continuously updated based on the characteristics of the scene. Pixels with more dynamic 
backgrounds or variations would require more frequent updates, resulting in a heavier workload. 


(b) 


Figure 1. Overview of the TMS320C6678 DSP (a) functional block diagram of the TMS320C6678 DSP and 
(b) EVM6678 development board 


Due to the irregular nature of the GMM algorithm, we chose the OOD approach, which offers 
significant advantages in simplifying the implementation of coarse-grain parallel algorithms [20]. Coarse- 
grained parallelism proves to be a suitable strategy for irregular loop algorithms, where the workload per 
iteration exhibits substantial variations. This approach entails dividing the overall task into larger units of 
work, with each unit representing a significant portion of the total workload. Coarse-grained parallelism 
effectively accommodates the irregularities in workload, reducing synchronization overhead compared to 
fine-grained approaches. This makes it particularly effective for scenarios where the computational 
requirements of different iterations vary widely. Overall, the OOD empowers users with more nuanced 
control over parallelization, leading to enhanced performance and improved stability in parallel programs, 
especially in cases where nested parallel regions are involved. 

The Algorithm 1 shows the pseudocode of our proposed implementation using OOD approach. In 
this case, the “omp for” directive in Background SubtractorGMM function is considered as an orphan 
directive. The utilization of an orphan directive in this context highlights a key aspect of our design strategy, 
emphasizing the parallelization of the BS process. Figure 2 shows CDnet 2012 highway sub-dataset. 
Figure 2(a) presents a visual representation of the input frame, providing a clear snapshot of the raw data 
processed from CDnet dataset [29]. Additionally, Figure 2(b) complements this by illustrating the generated 
mask, derived from our enhanced parallel implementation of the GMM algorithm. 


Algorithm 1. Parallel implementation of GMM BS using OOD approach 
//Main function 
1. omp_set_num threads (8); 
GmmModel defined by {Mean (mk), Weight(wk) and variance (ok); 
Perform all GMM model initialization; 
Get current input frame; 
Begin 
#pragma omp parallel 
Call Background _SubtractorGMM (In InputFrame, 


AAO SPW ND 


inout GMMModel, 
Out MaskFrame) ; 
8. end function 


//Background SubtractorGMM function 
1. #pragma omp for 
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For each (PixelIndex: = [0..SizeOfFrame [; PixelIndex++) 
Begin 
pixel= InputFrame[PixelIndex]; 
For each k Gaussian 
Begin 
diff (k) = abs (mk -pixel); 
if (diff[k] < Tmatch) then 
Update GMMModel {Mean (mk), Weight (œk) and variance (ok)}; 
else 
Update GMMModel {Weight (wk) }; 
end if 
end for 
Normalization of Weight (wk). 
For each k Gaussian 
Begin 
Rank and sort all Gaussians by the ratio wk ~ok; 
end for 
Retain the first B componants whose weight is greater than threshold (Th); 
if (pixel does not match background model) then 
mark pixel in MaskFrame as foreground; 
else 
mark pixel in MaskFrame as background; 
. end for 
. end function 
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(b) 


Figure 2. CDnet 2012 highway sub-dataset of (a) input frame and (b) generate mask [29] 


3. RESULTS AND DISCUSSION 

The new parallel approach proposed in this paper, based on OOD, was applied to different frame 
resolutions to validate the effectiveness of this method. The results obtained for various resolution frames 
(Figures 3 and 4), as presented in Figure 3(a) for low-resolution frames, Figure 3(b) for medium-resolution 
frames, Figure 4(a) for HD frames, and Figure 4(b) for full-HD frames, demonstrate that our proposed 
method yields improved results compared to OCS modes presented in prior work [19]. The dynamic 
scheduling with a chunk size equal to 128 provides the best speedup results, as presented in [19]. However, 
in the current work, our new approach based on OOD outperforms the OCS methods. 

Linear speedup was achieved for the low-resolution frame, as shown in Figure 3(a). However, we 
observed a decrease in speedup for medium resolution frame, as illustrated in Figure 3(b), starting from the 
seventh core. For HD and full-HD frames, a speedup decrease was noticed when using more than six 
activated cores as illustrated Figure 4(a) and Figure 4(b), respectively. This reduction in speedup can be 
attributed to the limitation of the DRAM memory bandwidth. Access to the DRAM is restricted to a single 
core at a time, utilizing a 64-bit interface [16]. 

By strategically aligning the OOD with the specific demands of the GMM BS algorithm, we 
succeeded in optimizing the algorithm's parallelization. As shown in Figure 5, the OOD provides the best 
parallel efficiency performance compared to the conventional OpenMP scheduling methods presented in 
[19]. The adoption of the OOD, subsequent to code reallocation, represents a strategic move to bolster the 
parallel efficiency of the GMM BS algorithm. 
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Figure 3. Comparison of obtained speedup between OOD and OCS approaches in (a) 320x240 frame 
resolution and (b) 640x480 frame resolution 
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Figure 4. Comparison of obtained speedup between OOD and OCS approaches in (a) 1280x720 frame 
resolution and (b) 1920x1080 frame resolution 


E OCS approach 
[OOD approach 


a 
te] 


40 


Parallel effeciency (%) 


320x240 640x480 1280x720 1920x1080 


Frame Resolution 
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4. CONCLUSION 

In this paper, we demonstrated the parallel implementation of the GMM BS algorithm on a 
multicore DSP platform using OpenMP. After conducting a comprehensive analysis of the GMM BS 
algorithm's workload and structural intricacies, we recognized the need for an adaptive approach due to the 
irregular workload per pixel processing. In this context, the integration of the orphan directive from the 
OpenMP API played a crucial role in achieving optimal performance, surpassing alternative scheduling 
modes such as dynamic, static, and guided. Indeed, we enhanced the GMM BS algorithm's processing 
capabilities, resulting in significant improvements in parallel efficiency. Specifically, we achieved 82% 
parallel efficiency for full-HD resolution frame and a linear speedup (i.e., 99.3% parallel efficiency) for low- 
resolution frame when all eight DSP cores were enabled. Looking ahead, our future work aims to expand the 
parallel implementation of the GMM BS algorithm to 16 DSP cores, followed by the parallel implementation 
of vehicle tracking processing chain. 
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